Notebook for MTAT.03.183 Data Mining project

Car rental demand detection based on price requests from online channels

This notebook is in addition to Business Understanding, Data Understanding, Project Planning file in project repository. There is also another R file which solves more questions.

The notebook is split into several parts. The questions of the project which have been analysed here are the following:

1) To which time period customers are looking for rental cars now/yesterday/last week/last month?

2) Are there unexpected changes (peaks, drops) in bookings or price requests?

3) How many price requests could be cached (have same pickup and return location, source country and driver age) for 1 hour, 3 hours, 12 hours?

4) What is actual demand after eliminating duplicated requests?

Tools used

In this notebook Pandas is mainly used. Also making use of plotly to make interactive graphs. Seaborn for non-interactive graphs. In some graphs also cufflinks was used for easier implementation. Note that we learned new stuff during creation of this notebook, so a lot of things could be written in a easier manner...

Datasets used

Rate quote data. In this notebook we use the first file we were given, around 23 GB of rate quote data from 01.12.2017 to 03.12.2017.

Reservations data. Making use of all reservations data we were given.

The description of columns and values in the data set are described in "Business Understanding, Data Understanding, Project Planning" file.

File with information to map contract_id's with brokers.

Data. Reading and cleaning data

We saw that that the data itself was really clean. At least the columns that we required in this project. In total there was more than 80 GB of request data and 32,000 KB reservations data.

In [76]:
import pandas as pd
import csv
%matplotlib inline
%pylab inline
pylab.rcParams['figure.figsize'] = (14, 6)
%load_ext autoreload
%autoreload 2
Populating the interactive namespace from numpy and matplotlib
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
In [19]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly import graph_objs as go
init_notebook_mode()
In [20]:
fields = ["timestamp", "pickup_timestamp", "return_timestamp", "broker_contract_id", "driver_age", 
          "request_uuid", "source_country_region_id", "pickup_desk_id", "return_desk_id", "rental_days"]
df = pd.read_csv("rate_quote.csv", skipinitialspace=True, usecols=fields)
print(df.shape)
(2997586, 10)

We see that first file had close to 3 million rows.

Converting timestamp rows to datetime datatype. Pandas does it neatly for us.

In [21]:
df["timestamp"] = pd.to_datetime(df['timestamp'])
df["pickup_timestamp"] = pd.to_datetime(df['pickup_timestamp'])
df["return_timestamp"] = pd.to_datetime(df['return_timestamp'])

Merging rate quote data with info about brokers. We add Broker_name column to our data with this.

In [22]:
# Merge with broker contract name
fields = ["Contract_ID", "Contract_name", "Broker name"]
contract_df = pd.read_csv("broker_contracts.csv", skipinitialspace=True, usecols=fields)

print(contract_df.shape)
print(contract_df.dtypes)

contract_df.columns = ["broker_contract_id", "Contract_name", "Broker_name"]
    
    
merged_requests = pd.merge(df, contract_df, on=["broker_contract_id"])

print(merged_requests.dtypes)
(27, 3)
Contract_ID       int64
Contract_name    object
Broker name      object
dtype: object
timestamp                   datetime64[ns]
pickup_desk_id                       int64
return_desk_id                       int64
pickup_timestamp            datetime64[ns]
return_timestamp            datetime64[ns]
rental_days                          int64
broker_contract_id                   int64
source_country_region_id             int64
driver_age                           int64
request_uuid                        object
Contract_name                       object
Broker_name                         object
dtype: object

What is the actual demand after eliminating duplicated requests?

One of the questions of the project was to see how many of the requests are duplicates and how does the demand change after removing the duplicates.

When exploring general demand we do not care about car specific columns and therefore we ignore them.

In general there were two ways to detect duplicates:

1) According to rate quotes data's "request_uuid" column. When a broker asks for prices then it makes several requests, each for several cars at a time. Requests with same contract share the request_uuid field.

2) Since there could be several different contracts for a broker then when broker asks for prices there will be requests with different request_uuid's, but still actually would count as one "demand".

Solving

For first case it is easy to answer. Since when we are exploring demands we want to remove all duplicates with same request_uuid. In this case we just drop duplicates according to request_uuid and just take the first one. Since we do not care about car specific columns then we can just ignore them. Pickup_time, return_time, source country et cetera will still be same for all of them.

In order to solve the case where there are different contracts the way we can look at it is to see requests which are made close together. If two requests which are timewise close together and share all the other values but request_uuid then they probable are made together in same request. Dependent on how close we require, timewise, we ask the requests it be there will be more or less false positives.

Phase 1 - removing duplicates using request_uuid

Group data by request_uuid, for each find minimal timestamp and set it as aggregate row's timestamp. We can use drop_duplicates, which will get first of each (we make the assumption that the data is in chronological order)..

In [23]:
# Used first one to make sure that the timestamp we pick is the first one. 

## Two ways to do it. Second one guarantees even if the data is not in chronological order
first_from_each_requestuuid = merged_requests.drop_duplicates("request_uuid")
#first_from_each_requestuuid = merged_requests[merged_requests.groupby('request_uuid')['timestamp'].rank() == 1 ]

How many removed and remaining?

Before: 2997586 After: 130915 Decrease of 95.63265240763735%

We see that there are A LOT of duplicates based already only on request_uuid. With this we can say that brokers on average request data on 23 cars (1/(1-0.95632)).

In [24]:
before_count = len(merged_requests)
after_count =  len(first_from_each_requestuuid)

print("Before:", before_count, "After:", after_count)
print("Decrease of {}%".format((100-(after_count/before_count)*100)))
Before: 2997586 After: 130915
Decrease of 95.63265240763735%

Saving a to new file

In [25]:
## Saving data frame (without duplicates by requestuuid)
first_from_each_requestuuid.to_csv("rate_quote_1_dup2.csv")

Phase 2 - Removing extra duplicates

Different request_uuid, but other parameters same (nearby timestamp)

1) Group by broker name

2) Group by all other columns which should remain the same: (["pickup_timestamp", "return_timestamp", "driver_age", "source_country_region_id", "pickup_desk_id", "return_desk_id", "rental_days"]

3) Check for orders within some set time difference, if other parameters except contract_id / request_uuid are the same then merge to be one request.

Currently the MAX_TIME_DIFFERENCE we are looking at is MAX_TIME_DIFFERENCE = 200000 # How many microseconds

In [26]:
## Group by broker_name
grouped_by_broker = first_from_each_requestuuid.groupby(first_from_each_requestuuid["Broker_name"])

MAX_TIME_DIFFERENCE = 200000 # How many microseconds

filtered_by_broker = {}
# For every order find all orders in time difference of 1 second.
duplicate_count = 0
for group in grouped_by_broker:
    previous_row = None
    in_difference = []
    # Group by all other columns now..
    all_other = group[1].groupby(["pickup_timestamp", "return_timestamp", "driver_age", 
          "source_country_region_id", "pickup_desk_id", "return_desk_id", "rental_days"])
    filtered_rows = []
    for group_other in all_other:
        SAME_PARAMETERS = group_other[1].sort_values(by="timestamp")
        # Now from inside group find merge which ones have similar timestamp (Need to set an E - 0.5s?) Consult on this 
        previous_row = None
        for row in SAME_PARAMETERS.itertuples():
            if previous_row:
                # If time difference is enough then add the row.
                if (row[1] - previous_row[1]).microseconds > MAX_TIME_DIFFERENCE:
                    filtered_rows.append(row)
                else:
                    duplicate_count += 1
            else:
                # We add first one no matter what.
                filtered_rows.append(row)
            previous_row = row
    filtered_by_broker[group[0]] = filtered_rows

    
print("Removed {} duplicates".format(duplicate_count))
Removed 51194 duplicates

Check how many removed

In [27]:
print("Before removing duplicates in this phase")

for group in grouped_by_broker:
    print(group[0], len(group[1]))

print("\nAfter removing duplicates")
    
for k,v in filtered_by_broker.items():
    print(k, len(v))
Before removing duplicates in this phase
BSP Auto 1577
Flexible Autos 150
Online Republic 30
RateChain test 55
Skyscanner 8
SupplierWebsite 904
TravelJigsaw 128177
Zuzuche 14

After removing duplicates
BSP Auto 1484
Flexible Autos 132
Online Republic 20
RateChain test 55
Skyscanner 8
SupplierWebsite 868
TravelJigsaw 77144
Zuzuche 10
In [28]:
filtered_df = pd.DataFrame([item for sub in filtered_by_broker.keys() for item in filtered_by_broker[sub]])

print("Before removing any duplicates:", before_count, "After:", len(filtered_df))
print("Decrease of {}%".format((100-(len(filtered_df)/before_count)*100)))
Before removing any duplicates: 2997586 After: 79721
Decrease of 97.34049331695572%

Saving to file

In [29]:
filtered_df.to_csv("new_rate_quote_1.csv")

Conclusions

After just remoing duplicates by request_uuid we got a decrease of ~95%. After also doing some additional removal by checking nearby timestamps and merging rows we got a total decrease of 97.3%. But since the last part might have false positives or negatives, which changes by changing MAX_TIME_DIFFERENCE.

After filtering (removing duplicates and unnecessary columns), a price quotes datafile of size 23 GB was reduced to 10,000 KB file.

Data Analysis

Number of requests per day

In [66]:
# requests per datetime
df_requests = pd.DataFrame(df["timestamp"])
requests_per_day = df_requests.groupby(df_requests["timestamp"].dt.date).count()
print(requests_per_day)
requests_per_day.plot.bar()
            timestamp
timestamp            
2017-12-01    1025955
2017-12-02     927905
2017-12-03    1043726
Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x2ce9f2714e0>

Number of requests per broker

In [67]:
filtered_per_contract = filtered_df.groupby(filtered_df["Broker_name"])
reqs_separately = merged_requests[["timestamp", "Broker_name"]]
requests_per_contract_ = reqs_separately.groupby(reqs_separately["Broker_name"])

requests_per_contractor = requests_per_contract_.count().plot(kind="bar", title="Request count by contractor")
requests_per_contractor.set_ylabel("Request count")


requests_per_contractor = reqs_separately.groupby(reqs_separately["Broker_name"]).count().plot(kind="bar", title="Request count by contractor in log scale")
requests_per_contractor.set_yscale("log", nonposy='clip')
requests_per_contractor.set_ylabel("Request count")
Out[67]:
Text(0,0.5,'Request count')

Requests per day per contractor

In [70]:
contractors = []
for group in requests_per_contract_:
    time_contract = pd.DataFrame(group[1]["timestamp"])
    contractors.append((group[0], time_contract.groupby(time_contract["timestamp"].dt.date).count()))
In [71]:
contractors_2 = []
for group in filtered_per_contract:
    time_contract = pd.DataFrame(group[1]["timestamp"])
    contractors_2.append((group[0], time_contract.groupby(time_contract["timestamp"].dt.date).count()))
    plt = time_contract.groupby(time_contract["timestamp"].dt.date).count().plot(
        kind="bar", title="Requests per day for contractor on filtered data " + str(group[0])
    )
    plt.set_ylabel("Request count")
    plt.set_xlabel("Request timestamp")

Differences between Before and After removing duplicates

In [141]:
l_new = []
for con in contractors_2:
    for row in con[1].itertuples():
        l_new.append([con[0], row[0], row[1]])

df_contractor_dreq_new = pd.DataFrame(l_new)
df_contractor_dreq_new.columns = ["contractor", "date", "count"]

l = []
for con in contractors:
    for row in con[1].itertuples():
        l.append([con[0], row[0], row[1]])

df_contractor_dreq = pd.DataFrame(l)
df_contractor_dreq.columns = ["contractor", "date", "count"]


a = df_contractor_dreq.groupby("contractor").sum()
b = df_contractor_dreq_new.groupby("contractor").sum()
result = pd.concat([a,b], axis=1)
result.columns = ["before", "after"]
result["contractor"] = result.index
print(result)

melted=pd.melt(result, id_vars=["contractor"])

import seaborn as sns
import matplotlib.pyplot as plt

ax = sns.barplot(x="contractor", y='value', hue="variable", data=melted, log=True)
ax.set(xlabel='Broker', ylabel='Number of requests', title="Number of requests before and after removing duplicates (log scale)")
plt.ylim(1, 10e6)
                  before  after       contractor
contractor                                      
BSP Auto           37702   1484         BSP Auto
Flexible Autos      3586    132   Flexible Autos
Online Republic      698     20  Online Republic
RateChain test      1260     55   RateChain test
Skyscanner            95      8       Skyscanner
SupplierWebsite    18834    868  SupplierWebsite
TravelJigsaw     2935201  77144     TravelJigsaw
Zuzuche              210     10          Zuzuche
Out[141]:
(1, 10000000.0)

Per contractor, how many requests per day

In [139]:
# For each contractor see how many requests per day..
contractors_dups = []
for group in df_dups.groupby(df_dups["Broker_name"]):
    time_contract = pd.DataFrame(group[1]["timestamp"])
    contractors_dups.append((group[0], time_contract.groupby(time_contract["timestamp"].dt.date).count()))

    
l_dups = []
for con in contractors_dups:
    for row in con[1].itertuples():
        l_dups.append([con[0], row[0], row[1]])

        
df_con_dups = pd.DataFrame(l_dups)
df_con_dups.columns = ["contractor", "date", "count"]

import seaborn as sns
import matplotlib.pyplot as plt
plt.ylim(10, 40)

sns.barplot(x='contractor', y='count', hue='date', data=df_con_dups, log=True)
plt.ylim(1, 10e6)
Out[139]:
(1, 10000000.0)

For each day, when are requested pickup dates

On data w/o duplicates.

On here we can also see an unexpected peak - February 7th, which is trending on several days.

In [ ]:
# group data by day

daylist = []
days = []
for group in df.groupby(df["timestamp"].dt.date):
    days.append(group[0])
    daylist.append(group[1])

    
In [88]:
for i, day in enumerate(daylist):
    day_pickup = pd.DataFrame(day["pickup_timestamp"])
    plt = day_pickup.groupby(day_pickup["pickup_timestamp"].dt.date).count()
    data = [go.Scatter(x=plt.index, y=plt["pickup_timestamp"])]
    
    layout = go.Layout(
    title='Number of requests per pickup dates for ' + str(days[i]),
        yaxis=dict(
        title='Count of requests'
        )
    )
    fig = go.Figure(data=data, layout=layout)
    iplot(fig)

Number of requests in time

In [95]:
requests = df_requests.reset_index().set_index("timestamp")
resampled_requests = requests.resample("10Min").count()

title = "Number of requests in time in period from 1st December to end of 3rd December  (interval 10 min)"
data = [go.Scatter(x=resampled_requests.index, y=resampled_requests["index"])]

layout = go.Layout(
title=title,
    yaxis=dict(
    title='Count of requests'
    )
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

For each day separately

In [96]:
for i, day in enumerate(daylist):
    reqs = pd.DataFrame(day["timestamp"]).reset_index().set_index("timestamp").resample("5Min").count()
    
    data = [go.Scatter(x=resampled_requests.index, y=resampled_requests["index"])]

    layout = go.Layout(
    title="Number of requests on " + str(days[i]) + " (interval 5 min)",
        yaxis=dict(
        title='Count of requests'
        )
    )

    fig = go.Figure(data=data, layout=layout)
    iplot(fig)

For each request we have pickup date and return date. Lets assume we have a request with pickup_date 27.11.2017 and return date 31.11.2017, then such request will add one value to each following day 27,28,29,30,31 (every day inbetween and including 27 and 31).

With this we can get some insight to which dates would have probably the largest number of active reservations. From here we can also see the unexpected peaks in the data - which dates are more interesting to people - unexpected peaks will be dates which have way more interest than neighbouring ones.

For example on the following graphs we can see that 3rd January is unexpectedly popular. Might be due to some conference being around that time.

Faster algorithm

To solve this we solve problem: Maximum number of overlapping intervals. Source: https://www.geeksforgeeks.org/find-the-point-where-maximum-intervals-overlap/

First we sort both pickup and return times by date. Then we iterate over all days between first pickup_date and last return_date.

For each day we count how many pickup times are on that day and how many returns are on that day. Each pickup on that day means that there is one more active reservation during that day and one return means that there is one less.

Also to make this considerably faster you can run this on data without duplicates, since distributions should remain the same (by broker)

Make use of filtered data

In [112]:
daylist = []
days = []
for group in filtered_df.groupby(filtered_df["timestamp"].dt.date):
    days.append(group[0])
    daylist.append(group[1])
In [105]:
def count_active_on_day(df):
    # Algorithm to find how many requests are made for a certain day. 
    # Accepts a dataframe with pickup dates and return dates (columns pickup_timestamp and return_timestamp)
    # Effective version based on this: https://www.geeksforgeeks.org/find-the-point-where-maximum-intervals-overlap/
    # Returns a dictionary with key being date and values being how many requests for that date. 

    pickup_sort = df["pickup_timestamp"].dt.normalize().sort_values().reset_index(drop=True)
    return_sort = df["return_timestamp"].dt.normalize().sort_values().reset_index(drop=True)
    
    start = pickup_sort[0]
    end = return_sort[len(df)-1]
    pickup_ind = 0 # which index are we currently on?
    return_ind = 0
    opened_bracket_count = 0
    counts_new = {}
    length = len(df)

    # For every date between first pickup_rate and last return_date 
    # find how many open reservations would be based on price requests.
    for date in pd.date_range(start, end, normalize=True):
        ## this means that a reservation ends on this day
        while opened_bracket_count >= 1 and return_sort[return_ind] < date:
            ## Remove one
            opened_bracket_count -= 1
            ## Move one index forward in return time array
            return_ind += 1
        ## This means that reservation starts on this day
        while pickup_ind < length and pickup_sort[pickup_ind] <= date:
            ## Add one
            opened_bracket_count += 1
            pickup_ind += 1
        counts_new[date] = opened_bracket_count

    return counts_new
In [113]:
from collections import defaultdict

# Do the same, now using faster algorithm.
day_counts = []
for day in daylist:
    day_counts.append(count_active_on_day(day))
In [115]:
for i in range(3):
    time_series_1 = pd.DataFrame(pd.Series(day_counts[i]))
    time_series_1.index.name = "timestamp"
    time_series_1.columns = ["count"] 
    title="How many times a day is between pickup and return date (on filtered data) " + str(days[i])
    df1=time_series_1.groupby(pd.Grouper(freq="D")).sum()
    
    data = [go.Scatter(x=df1.index, y=df1["count"])]

    layout = go.Layout(
    title=title,
        yaxis=dict(
        title='Count of occurences'
        )
    )

    fig = go.Figure(data=data, layout=layout)
    iplot(fig)

Hot dates by broker

Lets separate them by broker, but combine days.

We can see more on unexpected peaks - seems there is unexpected interest on early January and February 16

In [147]:
print(len(filtered_df))
print(len(df_dups))
79721
53416
In [121]:
# For each contractor find hot dates.

pickups_per_broker = filtered_df[["pickup_timestamp", "return_timestamp", "Broker_name"]]

broker_dict = {}
for group in pickups_per_broker.groupby(pickups_per_broker["Broker_name"]):
    broker_dict[group[0]] = count_active_on_day(group[1])


for k,v in broker_dict.items():
    time_series_1 = pd.DataFrame(pd.Series(v))
    time_series_1.columns = ["count"] 

    df1 = time_series_1.groupby(pd.Grouper(freq="D")).sum()
    data = [go.Scatter(x=df1.index, y=df1["count"])]

    layout = go.Layout(
     title="Hottest days for "+ k +" by day (on request data)",
        yaxis=dict(
        title='Count of occurences of the day between all pickup and return times'
        )
    )

    fig = go.Figure(data=data, layout=layout)
    iplot(fig)
    
    

Separated by day

In [149]:
# For each contractor find hot dates.

import cufflinks as cf
cf.go_offline()

fields = ["timestamp", "pickup_timestamp", "return_timestamp", "Broker_name"]
df_dups = filtered_df

df_dups["timestamp"] = pd.to_datetime(df_dups['timestamp'])
df_dups["pickup_timestamp"] = pd.to_datetime(df_dups['pickup_timestamp'])
df_dups["return_timestamp"] = pd.to_datetime(df_dups['return_timestamp'])


broker_dict_dups = defaultdict(dict)
for group in df_dups.groupby(df_dups["Broker_name"]):
    for g in group[1].set_index("timestamp").groupby(pd.Grouper(freq="D")):
        broker_dict_dups[group[0]][g[0]] = count_active_on_day(g[1])

brokers = df_dups["Broker_name"].unique()

for broker in brokers:
    k = broker
    v = broker_dict_dups[k]
    data = pd.DataFrame(v)
    #  print(data)
    data.iplot(kind='scatter', filename='cufflinks/cf-simple-line', title=k + " requests hot days on filtered data", yTitle="Count of occurences", xTitle="Date")

Analysis on Reservation Data

In [99]:
fields = ["rate_request_timestamp", "reservation_request_timestamp", "pickup_timestamp", "return_timestamp"]
df = pd.read_csv("reservations.csv", skipinitialspace=True, usecols=fields)
print(df.shape)

df["rate_request_timestamp"] = pd.to_datetime(df['rate_request_timestamp'])
df["reservation_request_timestamp"] = pd.to_datetime(df['reservation_request_timestamp'])
df["pickup_timestamp"] = pd.to_datetime(df['pickup_timestamp'])
df["return_timestamp"] = pd.to_datetime(df['return_timestamp'])
print(df.dtypes)
(9747, 4)
pickup_timestamp                 datetime64[ns]
return_timestamp                 datetime64[ns]
rate_request_timestamp           datetime64[ns]
reservation_request_timestamp    datetime64[ns]
dtype: object

Number of Reservations made on date

To see how many requests were made on a certain date

In [131]:
df_reservation_requests = df["reservation_request_timestamp"]
reservation_requests = df_reservation_requests.reset_index().set_index("reservation_request_timestamp")
resampled_requests = reservation_requests.resample("1D").count()
data = [go.Scatter(x=resampled_requests.index, y=resampled_requests["index"])]


title="Number of reservation requests in time"

layout = go.Layout(
 title=title,
    yaxis=dict(
    title='Count of occurences'
    )
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

Pickup times on reservation data

Note that the data since December is in the future and before that is in the past. We'll look at this later.

In [130]:
df_pickup_times = df["pickup_timestamp"]
pickup_times = df_pickup_times.reset_index().set_index("pickup_timestamp")
resampled_requests = pickup_times.resample("1D").count()


data = [go.Scatter(x=resampled_requests.index, y=resampled_requests["index"])]
title="Number of pickup times on a date (daily) in reservation data"


layout = go.Layout(
 title=title,
    yaxis=dict(
    title='Count of occurences'
    )
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

Hot dates on confirmed reservations only

All in the future. From here we can see for example that february 18 is unexpectedly popular.

In [137]:
fields = ["rate_request_timestamp", "reservation_request_timestamp", "pickup_timestamp", 
          "return_timestamp", "Broker name", "reservation_status_type"]
df_res = pd.read_csv("reservations.csv", skipinitialspace=True, usecols=fields)
print(df_res.shape)

df_res = df_res[df_res.reservation_status_type == "CONFIRMED"]

df_res["rate_request_timestamp"] = pd.to_datetime(df_res['rate_request_timestamp'])
df_res["reservation_request_timestamp"] = pd.to_datetime(df_res['reservation_request_timestamp'])
df_res["pickup_timestamp"] = pd.to_datetime(df_res['pickup_timestamp'])
df_res["return_timestamp"] = pd.to_datetime(df_res['return_timestamp'])
print(df_res.dtypes)

# For each contractor find hot dates.


broker_dict_res = {}
for group in df_res.groupby(df_res["Broker name"]):
    broker_dict_res[group[0]] = count_active_on_day(group[1])


for k,v in broker_dict_res.items():
    time_series_1 = pd.DataFrame(pd.Series(v))
    time_series_1.index.name = "timestamp"
    time_series_1.columns = ["count"] 
    df1=time_series_1.groupby(pd.Grouper(freq="D")).sum()
    title="Hottest days for "+ k +" by day (on CONFIRMED reservation data)"
    data = [go.Scatter(x=df1.index, y=df1["count"])]


    layout = go.Layout(
     title=title,
        yaxis=dict(
        title='Count of occurences'
        )
    )

    fig = go.Figure(data=data, layout=layout)
    iplot(fig)

    
(9747, 6)
pickup_timestamp                 datetime64[ns]
return_timestamp                 datetime64[ns]
reservation_status_type                  object
rate_request_timestamp           datetime64[ns]
reservation_request_timestamp    datetime64[ns]
Broker name                              object
dtype: object

Hot dates on completed reservations (PAST)

In [138]:
fields = ["rate_request_timestamp", "reservation_request_timestamp", "pickup_timestamp", 
          "return_timestamp", "Broker name", "reservation_status_type"]
df_res = pd.read_csv("reservations.csv", skipinitialspace=True, usecols=fields)
print(df_res.shape)

df_res = df_res[df_res.reservation_status_type == "COMPLETED"]

df_res["rate_request_timestamp"] = pd.to_datetime(df_res['rate_request_timestamp'])
df_res["reservation_request_timestamp"] = pd.to_datetime(df_res['reservation_request_timestamp'])
df_res["pickup_timestamp"] = pd.to_datetime(df_res['pickup_timestamp'])
df_res["return_timestamp"] = pd.to_datetime(df_res['return_timestamp'])
print(df_res.dtypes)

# For each contractor find hot dates.


broker_dict_res = {}
for group in df_res.groupby(df_res["Broker name"]):
    broker_dict_res[group[0]] = count_active_on_day(group[1])


for k,v in broker_dict_res.items():
    time_series_1 = pd.DataFrame(pd.Series(v))
    time_series_1.index.name = "timestamp"
    time_series_1.columns = ["count"] 
    df1=time_series_1.groupby(pd.Grouper(freq="D")).sum()
    title="Hottest days for "+ k +" by day (on COMPLETED reservation data)"
    data = [go.Scatter(x=df1.index, y=df1["count"])]


    layout = go.Layout(
     title=title,
        yaxis=dict(
        title='Count of occurences'
        )
    )

    fig = go.Figure(data=data, layout=layout)
    iplot(fig)
(9747, 6)
pickup_timestamp                 datetime64[ns]
return_timestamp                 datetime64[ns]
reservation_status_type                  object
rate_request_timestamp           datetime64[ns]
reservation_request_timestamp    datetime64[ns]
Broker name                              object
dtype: object

Percentage of Requests made for next day...

What percentage of requests are made to for next day? next 3 days? next 7 days? next 14 days? next 30 days? next 60 days? next 90 days? next 180 days? next 360 days?

In [154]:
fields = ["timestamp", "pickup_timestamp", "return_timestamp", "Broker_name"]
df_dups = pd.read_csv("rateq_dup.csv", skipinitialspace=True, usecols=fields)

print(df_dups.shape)
df_dups["timestamp"] = pd.to_datetime(df_dups['timestamp'])
df_dups["pickup_timestamp"] = pd.to_datetime(df_dups['pickup_timestamp'])
df_dups["return_timestamp"] = pd.to_datetime(df_dups['return_timestamp'])
print(df_dups.dtypes)

daylist = []
days = []

for group in df_dups.groupby(df_dups["timestamp"].dt.date):
    days.append(group[0])
    daylist.append(group[1])

    

cont_dict = defaultdict(dict)

day_offsets = [0, 1,3,7,14,30,60,90,180]
for daynr, day in enumerate(daylist):
    broker_dict_dups = {}
    for group in day.groupby(df_dups["Broker_name"]):
        broker_dict_dups[group[0]] = count_active_on_day(group[1])


    for k,v in broker_dict_dups.items():
        time_series_1 = pd.DataFrame(pd.Series(v))
        time_series_1.index.name = "timestamp"
        time_series_1.columns = ["count"] 
        summ = time_series_1.sum()


        day_counts = []
        for off in day_offsets:
            day_counts.append(time_series_1.loc[time_series_1.index.min():time_series_1.index.min()+pd.DateOffset(off)].sum())

        proportions = [float(d/summ) for d in day_counts]

        df_prop = pd.DataFrame(
            {"first" : day_offsets,
             "proportion" : proportions
            }) 
        
        cont_dict[k][daynr] = df_prop

        
for broker in brokers:
    dc = pd.DataFrame([cont_dict[broker][i]["proportion"] for i in cont_dict[broker].keys()]).reset_index().T
    dc = dc.drop(dc.index[0])
    dc = dc.infer_objects()
    if not dc.empty:
        dc.columns = ([str(days[i]) for i in cont_dict[broker].keys()])
        dc["days"] = [0,1,3,7,14,30,60,90,180]
        
        
        layout = go.Layout(
            title = "Broker " + broker + " proportions of requests of first days",
            titlefont=dict(
                    family='Courier New, monospace'
            ),
            xaxis=dict(
                showgrid=True,
                zeroline=True,
                showline=True,
                #ticks='',
                showticklabels=True,
                title="Number of days ahead",
                range=[0,185]
            ),
            yaxis=dict(
                showgrid=True,
                zeroline=True,
                showline=True,
                #ticks='',
                showticklabels=True,
                range = [0,1],
                title="Proportion of requests"
            )
        )
        
        data = [go.Scatter(x=dc["days"],y=dc[day], name=day) for day in dc.columns[:-1]]
        fig = go.Figure(data=data, layout=layout)
        iplot(fig)
        
        
(456182, 4)
timestamp           datetime64[ns]
pickup_timestamp    datetime64[ns]
return_timestamp    datetime64[ns]
Broker_name                 object
dtype: object

Discussion

We can see that on most of the days 25% of requests request cars inside upcoming 30 days and half of the requests are within first 90 days. Noting that the derivative of the above line will get smaller as we we go forward. This means that there is more interest in earlier days than days after.

Caching

How many price requests could be cached (have same pickup and return location, source country and driver age) for 1 hour, 3 hours, 12 hours? Per contractor.

We will use initial data for these tasks.

Investigating maximum cacheability.

From request data see how many responses are exactly the same. For this we'll use

To do this we check how many duplicates there are in "Charge_log" column. Charge_log column is the response to the request. This gives somewhat of an idea of the problem.

In [30]:
fields = ["charge_log"]

df_cache = pd.read_csv("rate_quote.csv", skipinitialspace=True, usecols=fields)
In [31]:
# Drop duplicates
dropped = df_cache.drop_duplicates("charge_log")
In [32]:
print("Before removing any duplicates:", len(df_cache), "After:", len(dropped))
print("Decrease of {}%".format((100-(len(dropped)/len(df_cache))*100)))
Before removing any duplicates: 2997586 After: 26226
Decrease of 99.12509599390977%

Number of duplicate responses

We see that 99.1% of responses are the same. This of course should be taken with a grain of salt, as there can be a lot of coincidences in pricing between cars with different models et cetera. We can see that there are 26226 different responses in first data.

Caching requests by 1 hour, 3 hours and 12 hours

As we talked with the client we were told that there are following factors, which are used in creating a price response (consoidering that the market prices of cars havent changed in meanwhile):

  • pickup time,

  • rental duration,

  • pickup and return location,

  • source country,

  • driver age.

We looked at three different scenarios (for each contract separately and then merged by broker):

First we only looked at how many requests have same source, pickup, return, driver age.

Secondly we looked what would be if in addition return_timestamp and pickup_timestamp would also be exactly the same

Thirdly we looked that pickup day, driver age, pickup location, drop location, rental days length


In this notebook we only show the last one, the others are also in caching notebook separately.

Solution

For this I wrote function cached_by_fields.

Practically what it does is for each cache interval (60 minutes, 180 minutes, 720 minutes), in every 30 minutes it finds how many of requests nade in last 30 minutes could be cached according to those criteria.

Note that for faster calculations we make the assumption that cache is updated every 30 minutes. Not that this isn't totally accurate representation because there could be more duplicate requests in past 30 minutes.

This means that if we are looking at requests between 12:30-13:00 then in cache there are (12:30 - 60/180/720 minutes) - 12:30

In [13]:
# Use data without duplicates

fields = ["timestamp", "pickup_timestamp", "return_timestamp", "broker_contract_id", 
          "source_country_region_id", "driver_age", "pickup_desk_id", "return_desk_id", "rental_days",
         "Broker_name"]
df_dups = pd.read_csv("new_rate_quote_1.csv", skipinitialspace=True, usecols=fields)
print(df_dups.shape)

df_dups["timestamp"] = pd.to_datetime(df_dups['timestamp'])
df_dups["pickup_timestamp"] = pd.to_datetime(df_dups['pickup_timestamp'])
df_dups["return_timestamp"] = pd.to_datetime(df_dups['return_timestamp'])
(79721, 10)
In [62]:
from collections import defaultdict
def cached_by_fields(dup_fields, merged_requests):
    """
    dup_fields: list of fields which ahve to be the same for it to be "cached".
    merged_requests: dataframe to work on.
    
    returns merged_datas, which is a list of dictionaries for each time interval (60 min, 180 min, 720 min)
    Each of those dictionaries has keys of contractors which in accordance refers to a dictionary with keys being
    date.
    
    Note that there are many places to optimize in here for faster performance.
    """
    ## Set index to be timstamp
    
    df_new_dups_2 = merged_requests.set_index("timestamp")
    ## Get interesteing columns
    df_new_dups_2 = df_new_dups_2[dup_fields]
    ## Normalize pickup_timestamps by hour. 
    df_new_dups_2["pickup_timestamp"] = df_new_dups_2["pickup_timestamp"].dt.normalize()
    ## Find which brokers in data
    brokers = merged_requests["Broker_name"].unique()
    
    merged_datas = []
    
    # For 60 minutes, 180 minutes and 720 minutes.
    for frq in [60,180,720]:
        merged_data_3 = defaultdict(dict)
        ## For each broker/contractor in data
        for contractor in brokers:
            ## Select rows for the broker name
            df_contractor = df_new_dups_2[df_new_dups_2["Broker_name"] == contractor]
            if not df_contractor.empty:
                ### Sort contractor data by time
                df_contractor = df_contractor.sort_index()
                date_range = pd.date_range(df_contractor.index.min(), df_contractor.index.max(), freq="30Min")
                for i in range(len(date_range)-1):
                    ## Find rows which are between two dates (Those requests are the ones that are cached)
                    cached_requests = df_contractor[date_range[i]-pd.Timedelta(minutes=frq):date_range[i]]
                    ## From between_dates find current requests.
                    current_requests = df_contractor[date_range[i+1]-pd.Timedelta(minutes=30):date_range[i+1]]
                    ## How many of current requests are not in the cache:
                    not_in_cache =  pd.merge(current_requests, 
                                             cached_requests, how='outer', indicator=True)
                    not_in_cache = not_in_cache[not_in_cache["_merge"] == "left_only"]
                    
                    ## How many requests made last 30 minutes
                    total = len(current_requests)
                    ## How many non-duplicates between two dates
                    non_dup_count = len(not_in_cache)
                    cached = total - non_dup_count
                    ## What is the proportion
                    proportion = 0 if total == 0 else cached/total
                    ## Add the sum of cached, total requests and proportion to dictionary
                    merged_data_3[contractor][date_range[i+1]] = [cached, total, proportion]

        merged_datas.append(merged_data_3)
    

    return merged_datas
In [63]:
dup_fields = ["driver_age", "source_country_region_id", "broker_contract_id", 
              "pickup_desk_id", "return_desk_id", "pickup_timestamp", "rental_days", "Broker_name"]

merged_datas = cached_by_fields(dup_fields, merged_requests)

Caching in graphs

First one is with absolute number, the second one is with proportions.

In [59]:
for broker in merged_requests["Broker_name"].unique():
    tJigs = []
    for merged_data in merged_datas:
        tJig = pd.DataFrame(merged_data[broker]).T
        if tJig.empty:
            continue
        tJigs.append(tJig)


    layout = go.Layout(
        title = broker + " absolute number of requests cached in 30 minute intervals with different cache lengths",
        titlefont=dict(
                family='Courier New, monospace'
        ),
        xaxis=dict(
            showgrid=True,
            zeroline=True,
            showline=True,
            showticklabels=True,
            title="Date",
        ),
        yaxis=dict(
            showgrid=True,
            zeroline=True,
            showline=True,
            showticklabels=True,
            title="Absolute count of cacheable requests"
        )
    )
    data = []
    try:
        data = [go.Scatter(x=tJigs[0].index, y=tJigs[0][0], name="1 hour"), 
                go.Scatter(x=tJigs[1].index, y=tJigs[1][0], name="3 hours"), 
                go.Scatter(x=tJigs[2].index, y=tJigs[2][0], name="12 hours")]
        
    except:
        pass
    if data:
        fig = go.Figure(data=data, layout=layout)
        iplot(fig)
In [60]:
for broker in merged_requests["Broker_name"].unique():
    tJigs = []
    for merged_data in merged_datas:
        tJig = pd.DataFrame(merged_data[broker]).T
        if tJig.empty:
            continue
        tJigs.append(tJig)


    layout = go.Layout(
        title = broker + " proportion of requests cached in 30 minute intervals with different cache lengths",
        titlefont=dict(
                family='Courier New, monospace'
        ),
        xaxis=dict(
            showgrid=True,
            zeroline=True,
            showline=True,
            showticklabels=True,
            title="Date",
        ),
        yaxis=dict(
            showgrid=True,
            zeroline=True,
            showline=True,
            showticklabels=True,
            title="Proportion of cacheable requests"
        )
    )
    data = []
    try:
        data = [go.Scatter(x=tJigs[0].index, y=tJigs[0][2], name="1 hour"), 
                go.Scatter(x=tJigs[1].index, y=tJigs[1][2], name="3 hours"), 
                go.Scatter(x=tJigs[2].index, y=tJigs[2][2], name="12 hours")]
        
    except:
        pass
    if data:
        fig = go.Figure(data=data, layout=layout)
        iplot(fig)

Discussion of results

First of all we can see that for brokers with low activity, using a timed cache might not really be a good idea as the proportion of cached requests is almost 0. For TravelJigsaw we can see that actually the proportion of cached requests can get up to around 50% at some periods.

Anyways, since there are different number of requests dependent on the time of the day then using only time as caching criteria might not be good idea. Could also set some lower limit on the cache row count.

Note also that this is an approximation of what would've happened in real, since in this case requests which are in cache are in previous timeframe. Ex if we are looking at requests from 12:30 to 13:00 then requests which are in cache are for 60 minute case 11:30 to 12:30.